This study presents a comprehensive transcriptomic analysis of lung tissue from individuals with chronic obstructive pulmonary disease (COPD) compared to controls using the GEO dataset GSE76925. After quality control filtering based on probe detection p-values and log2 transformation of intensities, unsupervised principal component analysis revealed structured transcriptional variation across samples. While the dominant variance axis (PC1) was not directly associated with disease status, a secondary axis (PC2) demonstrated significant separation between COPD and control groups, suggesting the presence of a robust but heterogeneous disease-associated transcriptional signal.
Supervised differential expression modelling using limma with empirical Bayes moderation identified widespread gene dysregulation in COPD lung tissue. Pathway-level analysis through Gene Ontology, Reactome over-representation analysis, and rank-based Gene Set Enrichment Analysis (GSEA) consistently highlighted coordinated upregulation of extracellular matrix remodeling and SLIT–ROBO signaling pathways, alongside downregulation of translational and ribosomal biogenesis programs. Together, these findings indicate that COPD is characterised not by isolated gene-level perturbations but by systematic reprogramming of structural and biosynthetic pathways, consistent with chronic tissue injury and airway remodeling.
This notebook analyses the microarray dataset
GSE76925 (COPD vs control lung tissue).
The raw file includes two types of columns:
# CRAN + Bioconductor package helper
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
load_or_install <- function(pkgs, bioc = FALSE) {
for (p in pkgs) {
if (!requireNamespace(p, quietly = TRUE)) {
if (bioc) BiocManager::install(p, ask = FALSE, update = FALSE) else install.packages(p)
}
suppressPackageStartupMessages(library(p, character.only = TRUE))
}
}
if (!requireNamespace("pheatmap", quietly = TRUE)) install.packages("pheatmap")
# CRAN packages
load_or_install(c("tidyverse", "data.table", "pheatmap"))
# Bioconductor packages
load_or_install(c("GEOquery", "limma", "clusterProfiler", "org.Hs.eg.db", "enrichplot", "ReactomePA"),
bioc = TRUE)
# Read gz file directly
expr_raw <- read.delim(
"GSE76925_non-normalized (2).txt.gz",
header = TRUE,
sep = "\t",
check.names = FALSE
)
dim(expr_raw)
## [1] 47249 302
head(expr_raw[, 1:6])
Interpretation: columns such as sample1
are intensities; columns such as sample1.Detection.Pval are
QC flags for that probe in that sample.
What we do: separate the raw table into two matrices.
Why it matters: detection p-values are not expression. If you leave them in the expression matrix, PCA, differential expression, and clustering become meaningless.
# Expression columns are those NOT ending in ".Detection.Pval"
expr_mat <- expr_raw[, !grepl("Detection\\.Pval$", colnames(expr_raw)), drop = FALSE]
# Detection p-value columns DO end in ".Detection.Pval"
detp_mat <- expr_raw[, grepl("Detection\\.Pval$", colnames(expr_raw)), drop = FALSE]
# Sanity checks
dim(expr_mat)
## [1] 47249 151
dim(detp_mat)
## [1] 47249 151
head(colnames(expr_mat), 6)
## [1] "sample1" "sample2" "sample3" "sample4" "sample5" "sample6"
head(colnames(detp_mat), 6)
## [1] "sample1.Detection.Pval" "sample2.Detection.Pval" "sample3.Detection.Pval"
## [4] "sample4.Detection.Pval" "sample5.Detection.Pval" "sample6.Detection.Pval"
What we do: keep probes detected (detection p-value < 0.01) in at least 20% of samples.
Why it matters: microarrays include probes that are effectively noise in a given tissue. Keeping them inflates the multiple testing burden and adds noise to PCA and modelling.
detected <- detp_mat < 0.01
keep <- rowMeans(detected, na.rm = TRUE) >= 0.20
expr_f <- expr_mat[keep, , drop = FALSE]
dim(expr_f)
## [1] 21618 151
What we do: log2(x + 1).
Why it matters: intensities are right-skewed and variance increases with the mean. Log transformation stabilises variance and improves linear modelling assumptions.
expr_log <- log2(expr_f + 1)
summary(as.vector(expr_log))
## Length Class Mode
## sample1 21618 -none- numeric
## sample2 21618 -none- numeric
## sample3 21618 -none- numeric
## sample4 21618 -none- numeric
## sample5 21618 -none- numeric
## sample6 21618 -none- numeric
## sample7 21618 -none- numeric
## sample8 21618 -none- numeric
## sample9 21618 -none- numeric
## sample10 21618 -none- numeric
## sample11 21618 -none- numeric
## sample12 21618 -none- numeric
## sample13 21618 -none- numeric
## sample14 21618 -none- numeric
## sample15 21618 -none- numeric
## sample16 21618 -none- numeric
## sample17 21618 -none- numeric
## sample18 21618 -none- numeric
## sample19 21618 -none- numeric
## sample20 21618 -none- numeric
## sample21 21618 -none- numeric
## sample22 21618 -none- numeric
## sample23 21618 -none- numeric
## sample24 21618 -none- numeric
## sample25 21618 -none- numeric
## sample26 21618 -none- numeric
## sample27 21618 -none- numeric
## sample28 21618 -none- numeric
## sample29 21618 -none- numeric
## sample30 21618 -none- numeric
## sample31 21618 -none- numeric
## sample32 21618 -none- numeric
## sample33 21618 -none- numeric
## sample34 21618 -none- numeric
## sample35 21618 -none- numeric
## sample36 21618 -none- numeric
## sample37 21618 -none- numeric
## sample38 21618 -none- numeric
## sample39 21618 -none- numeric
## sample40 21618 -none- numeric
## sample41 21618 -none- numeric
## sample42 21618 -none- numeric
## sample43 21618 -none- numeric
## sample44 21618 -none- numeric
## sample45 21618 -none- numeric
## sample46 21618 -none- numeric
## sample47 21618 -none- numeric
## sample48 21618 -none- numeric
## sample49 21618 -none- numeric
## sample50 21618 -none- numeric
## sample51 21618 -none- numeric
## sample52 21618 -none- numeric
## sample53 21618 -none- numeric
## sample54 21618 -none- numeric
## sample55 21618 -none- numeric
## sample56 21618 -none- numeric
## sample57 21618 -none- numeric
## sample58 21618 -none- numeric
## sample59 21618 -none- numeric
## sample60 21618 -none- numeric
## sample61 21618 -none- numeric
## sample62 21618 -none- numeric
## sample63 21618 -none- numeric
## sample64 21618 -none- numeric
## sample65 21618 -none- numeric
## sample66 21618 -none- numeric
## sample67 21618 -none- numeric
## sample68 21618 -none- numeric
## sample69 21618 -none- numeric
## sample70 21618 -none- numeric
## sample71 21618 -none- numeric
## sample72 21618 -none- numeric
## sample73 21618 -none- numeric
## sample74 21618 -none- numeric
## sample75 21618 -none- numeric
## sample76 21618 -none- numeric
## sample77 21618 -none- numeric
## sample78 21618 -none- numeric
## sample79 21618 -none- numeric
## sample80 21618 -none- numeric
## sample81 21618 -none- numeric
## sample82 21618 -none- numeric
## sample83 21618 -none- numeric
## sample84 21618 -none- numeric
## sample85 21618 -none- numeric
## sample86 21618 -none- numeric
## sample87 21618 -none- numeric
## sample88 21618 -none- numeric
## sample89 21618 -none- numeric
## sample90 21618 -none- numeric
## sample91 21618 -none- numeric
## sample92 21618 -none- numeric
## sample93 21618 -none- numeric
## sample94 21618 -none- numeric
## sample95 21618 -none- numeric
## sample96 21618 -none- numeric
## sample97 21618 -none- numeric
## sample98 21618 -none- numeric
## sample99 21618 -none- numeric
## sample100 21618 -none- numeric
## sample101 21618 -none- numeric
## sample102 21618 -none- numeric
## sample103 21618 -none- numeric
## sample104 21618 -none- numeric
## sample105 21618 -none- numeric
## sample106 21618 -none- numeric
## sample107 21618 -none- numeric
## sample108 21618 -none- numeric
## sample109 21618 -none- numeric
## sample110 21618 -none- numeric
## sample111 21618 -none- numeric
## sample112 21618 -none- numeric
## sample113 21618 -none- numeric
## sample114 21618 -none- numeric
## sample115 21618 -none- numeric
## sample116 21618 -none- numeric
## sample117 21618 -none- numeric
## sample118 21618 -none- numeric
## sample119 21618 -none- numeric
## sample120 21618 -none- numeric
## sample121 21618 -none- numeric
## sample122 21618 -none- numeric
## sample123 21618 -none- numeric
## sample124 21618 -none- numeric
## sample125 21618 -none- numeric
## sample126 21618 -none- numeric
## sample127 21618 -none- numeric
## sample128 21618 -none- numeric
## sample129 21618 -none- numeric
## sample130 21618 -none- numeric
## sample131 21618 -none- numeric
## sample132 21618 -none- numeric
## sample133 21618 -none- numeric
## sample134 21618 -none- numeric
## sample135 21618 -none- numeric
## sample136 21618 -none- numeric
## sample137 21618 -none- numeric
## sample138 21618 -none- numeric
## sample139 21618 -none- numeric
## sample140 21618 -none- numeric
## sample141 21618 -none- numeric
## sample142 21618 -none- numeric
## sample143 21618 -none- numeric
## sample144 21618 -none- numeric
## sample145 21618 -none- numeric
## sample146 21618 -none- numeric
## sample147 21618 -none- numeric
## sample148 21618 -none- numeric
## sample149 21618 -none- numeric
## sample150 21618 -none- numeric
## sample151 21618 -none- numeric
Differential expression needs phenotype labels (COPD vs control). We pull these from GEO.
gse <- getGEO("GSE76925", GSEMatrix = TRUE)
eset <- gse[[1]]
pheno <- pData(eset)
dim(pheno)
## [1] 151 58
colnames(pheno)[1:20]
## [1] "title" "geo_accession" "status"
## [4] "submission_date" "last_update_date" "type"
## [7] "channel_count" "source_name_ch1" "organism_ch1"
## [10] "characteristics_ch1" "characteristics_ch1.1" "characteristics_ch1.2"
## [13] "characteristics_ch1.3" "characteristics_ch1.4" "characteristics_ch1.5"
## [16] "characteristics_ch1.6" "characteristics_ch1.7" "characteristics_ch1.8"
## [19] "characteristics_ch1.9" "characteristics_ch1.10"
head(pheno[, 1:10])
Goal: make sure expression columns and phenotype rows refer to the same samples.
ncol(expr_log)
## [1] 151
nrow(pheno)
## [1] 151
# Replace generic column names with GSM IDs from GEO
colnames(expr_log) <- pheno$geo_accession
head(colnames(expr_log))
## [1] "GSM2040792" "GSM2040793" "GSM2040794" "GSM2040795" "GSM2040796"
## [6] "GSM2040797"
Approach used here: parse the title
field.
If your dataset uses a different text pattern, change the
grepl() rule (this is the only fragile part).
group <- ifelse(grepl("case", pheno$title, ignore.case = TRUE), "COPD", "Control")
table(group)
## group
## Control COPD
## 40 111
group <- factor(group, levels = c("Control", "COPD"))
PCA is a sanity check: it tells you whether samples cluster by phenotype, whether outliers exist, and whether the main variance is disease-related or driven by other effects.
Scaling makes each probe contribute equally. This can help sometimes, but it can also over-weight noisy probes.
pca_scaled <- prcomp(t(expr_log), center = TRUE, scale. = TRUE)
pca_df <- data.frame(
PC1 = pca_scaled$x[, 1],
PC2 = pca_scaled$x[, 2],
group = group
)
ggplot(pca_df, aes(PC1, PC2, colour = group)) +
geom_point(size = 3, alpha = 0.85) +
stat_ellipse(aes(fill = group), geom = "polygon", alpha = 0.10, linetype = 2) +
theme_minimal(base_size = 13) +
labs(title = "PCA (scaled) - GSE76925", x = "PC1", y = "PC2", colour = "Group") +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
cat(
"Interpretation (PCA — scaled):\n",
"- Each point is one sample; distance between points reflects overall transcriptomic similarity.\n",
"- PC1 and PC2 are the top two axes of variation in the dataset.\n",
"- If COPD and Control form visibly different clusters, that suggests a strong disease-driven expression signal.\n",
"- If they overlap heavily, the disease signal is weaker than other factors (e.g., batch effects, smoking status, cell-type composition).\n",
"- Outliers far from the main cloud may indicate technical artefacts or biologically distinct samples.\n",
sep = ""
)
## Interpretation (PCA — scaled):
## - Each point is one sample; distance between points reflects overall transcriptomic similarity.
## - PC1 and PC2 are the top two axes of variation in the dataset.
## - If COPD and Control form visibly different clusters, that suggests a strong disease-driven expression signal.
## - If they overlap heavily, the disease signal is weaker than other factors (e.g., batch effects, smoking status, cell-type composition).
## - Outliers far from the main cloud may indicate technical artefacts or biologically distinct samples.
No scaling preserves the natural variance structure across genes, which is often reasonable for log-transformed microarray data.
pca_noscale <- prcomp(t(expr_log), center = TRUE, scale. = FALSE)
pca_df2 <- data.frame(
PC1 = pca_noscale$x[, 1],
PC2 = pca_noscale$x[, 2],
group = group
)
ggplot(pca_df2, aes(PC1, PC2, colour = group)) +
geom_point(size = 3, alpha = 0.85) +
stat_ellipse(aes(fill = group), geom = "polygon", alpha = 0.12, linetype = 2) +
theme_minimal(base_size = 13) +
labs(title = "PCA (no scaling) - GSE76925", x = "PC1", y = "PC2", colour = "Group") +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
cat(
"Interpretation (PCA — no scaling):\n",
"- This PCA keeps genes at their natural variance levels (high-variance genes influence PCs more).\n",
"- Often more realistic for microarrays because it preserves dominant biological variation.\n",
"- If separation improves here compared with the scaled PCA, it suggests a smaller subset of high-variance genes drives group structure.\n",
"- If separation worsens, scaling may have helped expose subtle disease signal.\n",
sep = ""
)
## Interpretation (PCA — no scaling):
## - This PCA keeps genes at their natural variance levels (high-variance genes influence PCs more).
## - Often more realistic for microarrays because it preserves dominant biological variation.
## - If separation improves here compared with the scaled PCA, it suggests a smaller subset of high-variance genes drives group structure.
## - If separation worsens, scaling may have helped expose subtle disease signal.
These tests help confirm whether disease status aligns with a major variance axis.
pc1_scores <- pca_noscale$x[, 1]
pc2_scores <- pca_noscale$x[, 2]
t.test(pc1_scores ~ group)
##
## Welch Two Sample t-test
##
## data: pc1_scores by group
## t = 0.1672, df = 63.565, p-value = 0.8677
## alternative hypothesis: true difference in means between group Control and group COPD is not equal to 0
## 95 percent confidence interval:
## -31.22036 36.92301
## sample estimates:
## mean in group Control mean in group COPD
## 2.0960079 -0.7553182
t.test(pc2_scores ~ group)
##
## Welch Two Sample t-test
##
## data: pc2_scores by group
## t = -6.519, df = 103.9, p-value = 2.602e-09
## alternative hypothesis: true difference in means between group Control and group COPD is not equal to 0
## 95 percent confidence interval:
## -46.74392 -24.93842
## sample estimates:
## mean in group Control mean in group COPD
## -26.34682 9.49435
df_pc1 <- data.frame(PC1 = pc1_scores, group = group)
df_pc2 <- data.frame(PC2 = pc2_scores, group = group)
ggplot(df_pc1, aes(group, PC1, fill = group)) +
geom_boxplot(alpha = 0.6) +
theme_minimal(base_size = 13) +
labs(title = "PC1 by group (no scaling)", x = NULL, y = "PC1") +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
legend.position = "none")
cat(
"Interpretation (PC1 by group):\n",
"- This tests whether the main axis of variation (PC1) differs between COPD and Control.\n",
"- A non-significant p-value means PC1 is likely driven by non-disease factors (e.g., technical batch, tissue composition, smoking).\n",
"- A significant p-value suggests COPD status contributes strongly to global variation in expression.\n",
sep = ""
)
## Interpretation (PC1 by group):
## - This tests whether the main axis of variation (PC1) differs between COPD and Control.
## - A non-significant p-value means PC1 is likely driven by non-disease factors (e.g., technical batch, tissue composition, smoking).
## - A significant p-value suggests COPD status contributes strongly to global variation in expression.
ggplot(df_pc2, aes(group, PC2, fill = group)) +
geom_boxplot(alpha = 0.6) +
theme_minimal(base_size = 13) +
labs(title = "PC2 by group (no scaling)", x = NULL, y = "PC2") +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
legend.position = "none")
cat(
"\n\nInterpretation (PC2 by group):\n",
"- This tests whether the second major axis (PC2) is associated with COPD status.\n",
"- A significant p-value here implies a robust disease-related transcriptional axis, even if PC1 is dominated by other factors.\n",
"- This is common in human lung datasets where cell mixture/batch drives PC1 and disease signal appears on PC2 or PC3.\n",
sep = ""
)
##
##
## Interpretation (PC2 by group):
## - This tests whether the second major axis (PC2) is associated with COPD status.
## - A significant p-value here implies a robust disease-related transcriptional axis, even if PC1 is dominated by other factors.
## - This is common in human lung datasets where cell mixture/batch drives PC1 and disease signal appears on PC2 or PC3.
We model each probe as a function of COPD status. limma fits linear models efficiently and uses empirical Bayes shrinkage for stable inference.
design <- model.matrix(~ group)
colnames(design)
## [1] "(Intercept)" "groupCOPD"
fit <- lmFit(expr_log, design)
fit <- eBayes(fit)
results <- topTable(
fit,
coef = "groupCOPD",
number = Inf,
adjust.method = "BH"
)
head(results)
sum(results$adj.P.Val < 0.05)
## [1] 1812
# Visualising the top differentially expressed probes across samples
library(pheatmap)
library(RColorBrewer)
topN <- 50
top_probes <- rownames(results[order(results$adj.P.Val), ])[1:topN]
mat_top <- expr_log[top_probes, , drop = FALSE]
mat_z <- t(scale(t(mat_top))) # z-score per gene
# Annotation
ann_col <- data.frame(Group = group)
rownames(ann_col) <- colnames(mat_z)
# Custom colors
ann_colors <- list(
Group = c(COPD = "firebrick", Control = "steelblue") # adjust names to match your group levels
)
pheatmap(
mat_z,
annotation_col = ann_col,
annotation_colors = ann_colors,
color = colorRampPalette(rev(brewer.pal(9, "RdBu")))(100),
show_colnames = FALSE,
show_rownames = TRUE,
fontsize_row = 7,
fontsize = 11,
border_color = NA, # removes grid lines between cells
clustering_distance_rows = "euclidean",
clustering_distance_cols = "euclidean",
clustering_method = "complete",
main = paste("Top", topN, "DE Probes (Z-scored)"),
treeheight_row = 30,
treeheight_col = 30
)
cat(
"Interpretation (Heatmap of top DE probes):\n",
"- Rows are the top differentially expressed probes/genes; columns are samples.\n",
"- Values are z-scored per gene (row-wise), so colours show relative up/down expression within each gene.\n",
"- Separation of COPD vs Control in clustering suggests a coherent disease signature.\n",
"- Mixed clustering suggests heterogeneity or confounding effects.\n",
"- Isolated samples may be outliers and should be checked in PCA/QC.\n",
sep = ""
)
## Interpretation (Heatmap of top DE probes):
## - Rows are the top differentially expressed probes/genes; columns are samples.
## - Values are z-scored per gene (row-wise), so colours show relative up/down expression within each gene.
## - Separation of COPD vs Control in clustering suggests a coherent disease signature.
## - Mixed clustering suggests heterogeneity or confounding effects.
## - Isolated samples may be outliers and should be checked in PCA/QC.
res_df <- results %>%
dplyr::mutate(
ProbeID = rownames(results),
negLog10FDR = -log10(adj.P.Val),
sig = adj.P.Val < 0.05 & abs(logFC) > 0.5
)
library(ggplot2)
# Volcano Plot
ggplot(res_df, aes(x = logFC, y = negLog10FDR)) +
geom_point(aes(color = sig), alpha = 0.7, size = 2) +
geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "gray40", linewidth = 0.7) +
geom_vline(xintercept = c(-1, 1), linetype = "dashed", color = "gray40", linewidth = 0.7) +
scale_color_manual(
values = c("FALSE" = "steelblue", "TRUE" = "firebrick"),
labels = c("FALSE" = "Not Significant", "TRUE" = "Significant")
) +
theme_minimal(base_size = 14) +
theme(
plot.title = element_text(hjust = 0.5, face = "bold", size = 16),
axis.title = element_text(face = "bold"),
legend.title = element_text(face = "bold"),
panel.grid.minor = element_blank(),
plot.background = element_rect(fill = "white", color = NA)
) +
labs(
title = "Volcano Plot — COPD vs Control",
x = "Log2 Fold Change",
y = "-log10(FDR)",
color = "Significance"
)
cat(
"Interpretation (Volcano Plot):\n",
"- Each point is one probe/gene.\n",
"- X-axis = log2 fold change (direction + magnitude of change in COPD vs Control).\n",
"- Y-axis = -log10(FDR), controlling for multiple testing.; higher points are more statistically significant.\n",
"- Points far left/right and high up are the most compelling candidates: large effect + strong significance.\n",
"- Dashed lines mark FDR = 0.05 (horizontal) and logFC = ±1 (vertical) cutoffs.\n",
"- A symmetric volcano suggests balanced up/down changes; strong skew may indicate global shifts or confounding.\n",
sep = ""
)
## Interpretation (Volcano Plot):
## - Each point is one probe/gene.
## - X-axis = log2 fold change (direction + magnitude of change in COPD vs Control).
## - Y-axis = -log10(FDR), controlling for multiple testing.; higher points are more statistically significant.
## - Points far left/right and high up are the most compelling candidates: large effect + strong significance.
## - Dashed lines mark FDR = 0.05 (horizontal) and logFC = ±1 (vertical) cutoffs.
## - A symmetric volcano suggests balanced up/down changes; strong skew may indicate global shifts or confounding.
# MA Plot
ggplot(res_df, aes(x = AveExpr, y = logFC)) +
geom_point(aes(color = sig), alpha = 0.7, size = 2) +
geom_hline(yintercept = 0, linetype = "dashed", color = "gray40", linewidth = 0.7) +
geom_hline(yintercept = 1, linetype = "dotted", color = "firebrick", linewidth = 0.6) +
geom_hline(yintercept = -1, linetype = "dotted", color = "steelblue", linewidth = 0.6) +
scale_color_manual(
values = c("FALSE" = "steelblue", "TRUE" = "firebrick"),
labels = c("FALSE" = "Not Significant", "TRUE" = "Significant")
) +
theme_minimal(base_size = 14) +
theme(
plot.title = element_text(hjust = 0.5, face = "bold", size = 16),
axis.title = element_text(face = "bold"),
legend.title = element_text(face = "bold"),
panel.grid.minor = element_blank(),
plot.background = element_rect(fill = "white", color = NA)
) +
labs(
title = "MA Plot — COPD vs Control",
x = "Average Expression",
y = "Log2 Fold Change",
color = "Significance"
)
cat(
"Interpretation (MA Plot):\n",
"- Each point is a probe/gene.\n",
"- X-axis = average expression (mean intensity across all samples).\n",
"- Y-axis = log2 fold change (COPD vs Control).\n",
"- This checks whether fold-changes depend on expression level.\n",
"- A funnel shape is normal (more noise at low expression).\n",
"- Dotted red/blue lines mark logFC = ±1 reference cutoffs.\n",
"- If extreme fold-changes occur only at very low expression, results may be noise-driven.\n",
sep = ""
)
## Interpretation (MA Plot):
## - Each point is a probe/gene.
## - X-axis = average expression (mean intensity across all samples).
## - Y-axis = log2 fold change (COPD vs Control).
## - This checks whether fold-changes depend on expression level.
## - A funnel shape is normal (more noise at low expression).
## - Dotted red/blue lines mark logFC = ±1 reference cutoffs.
## - If extreme fold-changes occur only at very low expression, results may be noise-driven.
top_up <- results[results$logFC > 0 & results$adj.P.Val < 0.05, ][1:10, ]
top_down <- results[results$logFC < 0 & results$adj.P.Val < 0.05, ][1:10, ]
top_up
top_down
The limma result rownames are Illumina probe IDs (for example, ILMN_1676938). Pathway tools usually expect gene symbols or Entrez IDs.
gpl <- getGEO("GPL10558")
gpl_tab <- Table(gpl)
# Build a mapping table from probe -> SYMBOL and Entrez
annot_map <- gpl_tab %>%
dplyr::transmute(
ProbeID = as.character(ID),
SYMBOL = trimws(as.character(Symbol)),
ENTREZID = trimws(as.character(Entrez_Gene_ID))
)
# Clean empty strings
annot_map$SYMBOL[annot_map$SYMBOL %in% c("", "NA")] <- NA
annot_map$ENTREZID[annot_map$ENTREZID %in% c("", "NA")] <- NA
# If SYMBOL has multiple mappings like "A /// B" or "A;B", keep the first
annot_map$SYMBOL <- sub("///.*$", "", annot_map$SYMBOL)
annot_map$SYMBOL <- sub(";.*$", "", annot_map$SYMBOL)
annot_map$SYMBOL <- trimws(annot_map$SYMBOL)
head(annot_map)
Definition used here: FDR < 0.05 and |logFC| > 0.5.
sig_strong <- results[results$adj.P.Val < 0.05 & abs(results$logFC) > 0.5, ]
sig_strong$ProbeID <- rownames(sig_strong)
annotated <- sig_strong %>%
dplyr::left_join(annot_map, by = "ProbeID")
head(annotated[, c("ProbeID", "SYMBOL", "ENTREZID", "logFC", "adj.P.Val")])
mean(!is.na(annotated$SYMBOL))
## [1] 0.9897523
mean(!is.na(annotated$ENTREZID))
## [1] 0.9897523
table(is.na(annotated$SYMBOL))
##
## FALSE TRUE
## 1159 12
ORA tests whether a set of “significant” genes is enriched for pathways compared with background.
We use Entrez IDs because GO and Reactome tools are more robust with Entrez than symbols.
entrez_up <- annotated %>%
dplyr::filter(logFC > 0) %>%
dplyr::pull(ENTREZID) %>%
na.omit() %>%
unique()
entrez_down <- annotated %>%
dplyr::filter(logFC < 0) %>%
dplyr::pull(ENTREZID) %>%
na.omit() %>%
unique()
length(entrez_up)
## [1] 69
length(entrez_down)
## [1] 1045
head(entrez_up)
## [1] "3357" "761" "348" "6507" "6363" "2266"
head(entrez_down)
## [1] "649214" "391359" "643431" "100130289" "344866" "81688"
ego_up <- enrichGO(
gene = entrez_up,
OrgDb = org.Hs.eg.db,
keyType = "ENTREZID",
ont = "BP",
pAdjustMethod = "BH",
qvalueCutoff = 0.05,
readable = TRUE
)
ego_down <- enrichGO(
gene = entrez_down,
OrgDb = org.Hs.eg.db,
keyType = "ENTREZID",
ont = "BP",
pAdjustMethod = "BH",
qvalueCutoff = 0.05,
readable = TRUE
)
head(as.data.frame(ego_up))
head(as.data.frame(ego_down))
dotplot(ego_up, showCategory = 15) +
ggtitle("GO BP - Upregulated in COPD") +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
cat(
"Interpretation (GO BP enrichment — Upregulated in COPD):\n",
"- This is over-representation analysis on genes upregulated in COPD.\n",
"- Dot size = number of genes contributing to the term; colour = adjusted p-value (FDR).\n",
"- Terms related to extracellular matrix, immune activation, chemotaxis, and remodelling are biologically coherent for COPD.\n",
"- Highly redundant GO terms are common; focus on the consistent themes rather than individual term names.\n",
sep = ""
)
## Interpretation (GO BP enrichment — Upregulated in COPD):
## - This is over-representation analysis on genes upregulated in COPD.
## - Dot size = number of genes contributing to the term; colour = adjusted p-value (FDR).
## - Terms related to extracellular matrix, immune activation, chemotaxis, and remodelling are biologically coherent for COPD.
## - Highly redundant GO terms are common; focus on the consistent themes rather than individual term names.
dotplot(ego_down, showCategory = 15) +
ggtitle("GO BP - Downregulated in COPD") +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
cat(
"\n\nInterpretation (GO BP enrichment — Downregulated in COPD):\n",
"- This is over-representation analysis on genes downregulated in COPD.\n",
"- Enrichment in RNA processing, splicing, translation, or ribosome-related terms often reflects suppression of baseline biosynthetic programmes.\n",
"- If few terms appear, it may mean: (i) fewer down genes, (ii) weaker effect sizes, or (iii) mapping drop-outs.\n",
sep = ""
)
##
##
## Interpretation (GO BP enrichment — Downregulated in COPD):
## - This is over-representation analysis on genes downregulated in COPD.
## - Enrichment in RNA processing, splicing, translation, or ribosome-related terms often reflects suppression of baseline biosynthetic programmes.
## - If few terms appear, it may mean: (i) fewer down genes, (ii) weaker effect sizes, or (iii) mapping drop-outs.
Reactome can be cleaner than GO for disease interpretation.
react_up <- enrichPathway(
gene = entrez_up,
organism = "human",
pAdjustMethod = "BH",
qvalueCutoff = 0.05,
readable = TRUE
)
react_down <- enrichPathway(
gene = entrez_down,
organism = "human",
pAdjustMethod = "BH",
qvalueCutoff = 0.05,
readable = TRUE
)
head(as.data.frame(react_up))
head(as.data.frame(react_down))
dotplot(react_up, showCategory = 15) + ggtitle("Reactome ORA - Up in COPD")
# Guard: do not plot if empty
if (nrow(as.data.frame(react_down)) == 0) {
cat("Reactome ORA (down) returned no significant pathways at q < 0.05.\n",
"This can happen if the down gene list is broad and non-specific, or if mapping reduces the list.\n")
} else {
dotplot(react_down, showCategory = 15) + ggtitle("Reactome ORA - Down in COPD")
}
## Reactome ORA (down) returned no significant pathways at q < 0.05.
## This can happen if the down gene list is broad and non-specific, or if mapping reduces the list.
cat(
"Interpretation (Reactome enrichment — Up in COPD):\n",
"- Reactome pathways are curated and often more interpretable than GO.\n",
"- Enrichment for ECM organisation, collagen degradation, integrin signalling, and immune pathways fits COPD tissue remodelling.\n",
"- Pathways with larger gene counts and low FDR are the strongest signals.\n",
sep = ""
)
## Interpretation (Reactome enrichment — Up in COPD):
## - Reactome pathways are curated and often more interpretable than GO.
## - Enrichment for ECM organisation, collagen degradation, integrin signalling, and immune pathways fits COPD tissue remodelling.
## - Pathways with larger gene counts and low FDR are the strongest signals.
cat(
"\n\nInterpretation (Reactome enrichment — Down in COPD):\n",
"- If this output is empty (0 rows), it usually means no Reactome pathways pass the chosen FDR cutoff.\n",
"- Common reasons: fewer mapped Entrez IDs for downregulated genes, weaker coordinated pathway structure, or conservative thresholds.\n",
"- You can explore with a relaxed cutoff (e.g., qvalueCutoff = 0.10) or switch to ranked GSEA (which does not require a hard DE cutoff).\n",
sep = ""
)
##
##
## Interpretation (Reactome enrichment — Down in COPD):
## - If this output is empty (0 rows), it usually means no Reactome pathways pass the chosen FDR cutoff.
## - Common reasons: fewer mapped Entrez IDs for downregulated genes, weaker coordinated pathway structure, or conservative thresholds.
## - You can explore with a relaxed cutoff (e.g., qvalueCutoff = 0.10) or switch to ranked GSEA (which does not require a hard DE cutoff).
GSEA uses the full ranked signal rather than a hard significance cut-off. This is often more stable.
Important: use the full mapping table from GPL, not only the significant subset.
gene_list <- results$t
entrez_map_full <- annot_map %>%
dplyr::select(ProbeID, ENTREZID) %>%
dplyr::filter(!is.na(ENTREZID))
names(gene_list) <- entrez_map_full$ENTREZID[match(rownames(results), entrez_map_full$ProbeID)]
This block prevents the common GSEA failures: NA names, duplicated Entrez IDs, and unsorted vectors.
# Must be numeric and named
stopifnot(is.numeric(gene_list))
stopifnot(!is.null(names(gene_list)))
# Remove NA and non-finite values
gene_list <- gene_list[is.finite(gene_list)]
gene_list <- gene_list[!is.na(names(gene_list))]
# Make names character (Entrez IDs)
names(gene_list) <- as.character(names(gene_list))
# Sort decreasing and drop duplicates (keep the best-ranked instance)
gene_list <- sort(gene_list, decreasing = TRUE)
gene_list <- gene_list[!duplicated(names(gene_list))]
# Final check: sorted decreasing
all(diff(gene_list) <= 0, na.rm = TRUE)
## [1] TRUE
length(gene_list)
## [1] 15718
head(gene_list)
## 3357 761 348 6507 6363 2266
## 5.038668 4.687792 4.685257 4.561199 4.363037 4.332848
gsea_reactome <- gsePathway(
geneList = gene_list,
organism = "human",
pAdjustMethod = "BH",
pvalueCutoff = 0.05,
verbose = FALSE
)
head(as.data.frame(gsea_reactome))
gsea_df <- as.data.frame(gsea_reactome)
if (is.null(gsea_reactome) || nrow(gsea_df) == 0) {
cat("No significant Reactome GSEA pathways at FDR < 0.05.\n",
"Try pvalueCutoff = 0.1, or double-check Entrez mapping.\n")
} else {
dotplot(gsea_reactome, showCategory = 15) +
ggtitle("Reactome GSEA - Ranked by limma t-statistic") +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
}
cat(
"Interpretation (Reactome GSEA — ranked by limma t-stat):\n",
"- This uses the full ranked gene list (by limma t-stat), not just a 'significant genes' cutoff.\n",
"- It detects pathways where many genes shift in a coordinated direction, even if individual genes are modest.\n",
"- NES (Normalised Enrichment Score) indicates direction:\n",
" * Positive NES = enriched at the top of the ranking (more up in COPD if ranking is COPD-positive).\n",
" * Negative NES = enriched at the bottom (more down in COPD).\n",
"- Very low adjusted p-values indicate strong, consistent pathway-level shifts.\n",
sep = ""
)
## Interpretation (Reactome GSEA — ranked by limma t-stat):
## - This uses the full ranked gene list (by limma t-stat), not just a 'significant genes' cutoff.
## - It detects pathways where many genes shift in a coordinated direction, even if individual genes are modest.
## - NES (Normalised Enrichment Score) indicates direction:
## * Positive NES = enriched at the top of the ranking (more up in COPD if ranking is COPD-positive).
## * Negative NES = enriched at the bottom (more down in COPD).
## - Very low adjusted p-values indicate strong, consistent pathway-level shifts.
library(enrichplot)
library(ggplot2)
if (!is.null(gsea_reactome) && nrow(as.data.frame(gsea_reactome)) > 0) {
gsea_df <- as.data.frame(gsea_reactome)
top_id <- gsea_df$ID[1]
top_desc <- gsea_df$Description[1]
gseaplot2(
gsea_reactome,
geneSetID = top_id,
title = paste("GSEA Enrichment —", top_desc),
color = "firebrick",
base_size = 13,
rel_heights = c(1.5, 0.4, 0.8)
)
} else {
message("No significant Reactome GSEA results found — skipping enrichment plot.")
}
cat(
"Interpretation (GSEA Enrichment Curve):\n",
"- The enrichment curve shows the running enrichment score as you move down the ranked gene list.\n",
"- Black tick marks show where genes from this pathway appear in the ranked list.\n",
"- A peak near the left = pathway genes concentrated among the most upregulated genes in COPD.\n",
"- A peak near the right = enrichment among downregulated genes.\n",
"- The bottom bar shows the ranked list metric (e.g. t-statistic or log2FC).\n",
"- This is often the most convincing single figure — it shows the enrichment signal directly.\n",
sep = ""
)
## Interpretation (GSEA Enrichment Curve):
## - The enrichment curve shows the running enrichment score as you move down the ranked gene list.
## - Black tick marks show where genes from this pathway appear in the ranked list.
## - A peak near the left = pathway genes concentrated among the most upregulated genes in COPD.
## - A peak near the right = enrichment among downregulated genes.
## - The bottom bar shows the ranked list metric (e.g. t-statistic or log2FC).
## - This is often the most convincing single figure — it shows the enrichment signal directly.
cat(
"Final Integrated Interpretation:\n\n",
"This transcriptomic analysis of COPD lung tissue reveals a coordinated molecular reprogramming characterised by extracellular matrix remodeling, altered SLIT–ROBO signaling, and suppression of core translational machinery.\n\n",
"PCA demonstrated that although COPD status does not dominate total variance (PC1), a significant disease-associated axis is present (PC2), indicating structured but heterogeneous transcriptional shifts.\n\n",
"Differential expression identified widespread dysregulation, and pathway-level enrichment confirmed that these changes are not random but reflect coordinated biological programs.\n\n",
"Rank-based Reactome GSEA further strengthened this conclusion by detecting consistent pathway-level shifts without arbitrary gene cutoffs, highlighting structural remodeling and translational suppression as dominant COPD-associated programs.\n\n",
"Together, these findings support a model in which COPD lung tissue undergoes chronic structural reorganization and stress-associated signaling reprogramming, rather than isolated gene-level perturbations.",
sep = ""
)
## Final Integrated Interpretation:
##
## This transcriptomic analysis of COPD lung tissue reveals a coordinated molecular reprogramming characterised by extracellular matrix remodeling, altered SLIT–ROBO signaling, and suppression of core translational machinery.
##
## PCA demonstrated that although COPD status does not dominate total variance (PC1), a significant disease-associated axis is present (PC2), indicating structured but heterogeneous transcriptional shifts.
##
## Differential expression identified widespread dysregulation, and pathway-level enrichment confirmed that these changes are not random but reflect coordinated biological programs.
##
## Rank-based Reactome GSEA further strengthened this conclusion by detecting consistent pathway-level shifts without arbitrary gene cutoffs, highlighting structural remodeling and translational suppression as dominant COPD-associated programs.
##
## Together, these findings support a model in which COPD lung tissue undergoes chronic structural reorganization and stress-associated signaling reprogramming, rather than isolated gene-level perturbations.
Keeping saved tables makes your GitHub repo reproducible.
dir.create("results", showWarnings = FALSE)
write.csv(annotated, file.path("results", "limma_sig_strong_annotated.csv"), row.names = FALSE)
write.csv(as.data.frame(ego_up), file.path("results", "GO_up_COPD.csv"), row.names = FALSE)
write.csv(as.data.frame(ego_down), file.path("results", "GO_down_COPD.csv"), row.names = FALSE)
write.csv(as.data.frame(react_up), file.path("results", "Reactome_ORA_up.csv"), row.names = FALSE)
write.csv(as.data.frame(react_down), file.path("results", "Reactome_ORA_down.csv"), row.names = FALSE)
write.csv(as.data.frame(gsea_reactome), file.path("results", "Reactome_GSEA.csv"), row.names = FALSE)
cat(
"Limitations:\n",
"- Microarray platforms have limited dynamic range compared to RNA-seq.\n",
"- Bulk lung tissue mixes multiple cell types, potentially confounding cell composition effects.\n",
"- Smoking status, inflammation level, and batch effects may contribute to dominant variance axes.\n",
"- Causality cannot be inferred from cross-sectional differential expression.\n",
"- Functional validation would be required to confirm pathway-level conclusions.\n",
sep = ""
)
## Limitations:
## - Microarray platforms have limited dynamic range compared to RNA-seq.
## - Bulk lung tissue mixes multiple cell types, potentially confounding cell composition effects.
## - Smoking status, inflammation level, and batch effects may contribute to dominant variance axes.
## - Causality cannot be inferred from cross-sectional differential expression.
## - Functional validation would be required to confirm pathway-level conclusions.
cat(
"\n\nClinical Context:\n",
"The observed enrichment of extracellular matrix and signaling pathways aligns with known COPD pathology involving airway remodeling, fibrosis, and altered epithelial–stromal communication.\n",
sep = ""
)
##
##
## Clinical Context:
## The observed enrichment of extracellular matrix and signaling pathways aligns with known COPD pathology involving airway remodeling, fibrosis, and altered epithelial–stromal communication.
cat(
"\n\nMethodological Summary:\n",
"Data were quality filtered using detection p-values, log2-transformed, analyzed using limma with empirical Bayes moderation, and interpreted using both over-representation analysis and rank-based GSEA to ensure robustness of pathway-level findings.\n",
sep = ""
)
##
##
## Methodological Summary:
## Data were quality filtered using detection p-values, log2-transformed, analyzed using limma with empirical Bayes moderation, and interpreted using both over-representation analysis and rank-based GSEA to ensure robustness of pathway-level findings.
sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.1.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] RColorBrewer_1.1-3 ReactomePA_1.50.0 enrichplot_1.26.6
## [4] org.Hs.eg.db_3.20.0 AnnotationDbi_1.68.0 IRanges_2.40.1
## [7] S4Vectors_0.44.0 clusterProfiler_4.14.6 limma_3.62.2
## [10] GEOquery_2.74.0 Biobase_2.66.0 BiocGenerics_0.52.0
## [13] pheatmap_1.0.13 data.table_1.17.0 lubridate_1.9.4
## [16] forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4
## [19] purrr_1.0.4 readr_2.1.5 tidyr_1.3.1
## [22] tibble_3.2.1 ggplot2_4.0.0 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] rstudioapi_0.17.1 jsonlite_2.0.0
## [3] magrittr_2.0.4 ggtangle_0.1.1
## [5] farver_2.1.2 rmarkdown_2.29
## [7] fs_1.6.6 zlibbioc_1.52.0
## [9] vctrs_0.6.5 memoise_2.0.1
## [11] ggtree_3.14.0 htmltools_0.5.9
## [13] S4Arrays_1.6.0 curl_6.2.2
## [15] SparseArray_1.6.2 gridGraphics_0.5-1
## [17] sass_0.4.10 bslib_0.10.0
## [19] httr2_1.1.2 plyr_1.8.9
## [21] cachem_1.1.0 igraph_2.2.2
## [23] lifecycle_1.0.5 pkgconfig_2.0.3
## [25] Matrix_1.7-3 R6_2.6.1
## [27] fastmap_1.2.0 gson_0.1.0
## [29] GenomeInfoDbData_1.2.13 MatrixGenerics_1.18.1
## [31] digest_0.6.39 aplot_0.2.9
## [33] colorspace_2.1-1 patchwork_1.3.2
## [35] GenomicRanges_1.58.0 RSQLite_2.3.9
## [37] labeling_0.4.3 timechange_0.3.0
## [39] polyclip_1.10-7 httr_1.4.7
## [41] abind_1.4-8 compiler_4.4.2
## [43] bit64_4.6.0-1 withr_3.0.2
## [45] graphite_1.52.0 S7_0.2.0
## [47] BiocParallel_1.40.0 viridis_0.6.5
## [49] DBI_1.2.3 ggforce_0.5.0
## [51] R.utils_2.13.0 MASS_7.3-65
## [53] rappdirs_0.3.4 DelayedArray_0.32.0
## [55] tools_4.4.2 ape_5.8-1
## [57] rentrez_1.2.4 R.oo_1.27.1
## [59] glue_1.8.0 nlme_3.1-168
## [61] GOSemSim_2.32.0 grid_4.4.2
## [63] reshape2_1.4.4 fgsea_1.32.4
## [65] generics_0.1.3 gtable_0.3.6
## [67] tzdb_0.5.0 R.methodsS3_1.8.2
## [69] hms_1.1.3 tidygraph_1.3.1
## [71] xml2_1.3.8 XVector_0.46.0
## [73] ggrepel_0.9.6 pillar_1.10.1
## [75] yulab.utils_0.2.4 splines_4.4.2
## [77] tweenr_2.0.3 treeio_1.30.0
## [79] lattice_0.22-6 bit_4.6.0
## [81] tidyselect_1.2.1 GO.db_3.20.0
## [83] Biostrings_2.74.1 reactome.db_1.89.0
## [85] knitr_1.50 gridExtra_2.3
## [87] SummarizedExperiment_1.36.0 xfun_0.51
## [89] graphlayouts_1.2.3 statmod_1.5.1
## [91] matrixStats_1.5.0 stringi_1.8.7
## [93] UCSC.utils_1.2.0 lazyeval_0.2.2
## [95] ggfun_0.2.0 yaml_2.3.10
## [97] evaluate_1.0.3 codetools_0.2-20
## [99] ggraph_2.2.2 qvalue_2.38.0
## [101] graph_1.84.1 BiocManager_1.30.25
## [103] ggplotify_0.1.3 cli_3.6.5
## [105] jquerylib_0.1.4 Rcpp_1.1.1
## [107] GenomeInfoDb_1.42.3 png_0.1-8
## [109] XML_3.99-0.18 parallel_4.4.2
## [111] blob_1.2.4 DOSE_4.0.1
## [113] viridisLite_0.4.2 tidytree_0.4.7
## [115] scales_1.4.0 crayon_1.5.3
## [117] rlang_1.1.7 cowplot_1.2.0
## [119] fastmatch_1.1-8 KEGGREST_1.46.0